Observational study;
Causal Inference;
Matching;
R;
library(DOS) # Design of Observational Studies
library(tidyverse)
library(kableExtra)
library(optmatch)
Multivariate matching is involved in 3 steps:
(i) creating a distance matrix,
(ii) adding a propensity score caliper to the distance matrix, and
(iii) finding an optimal match.
Matching in R created largely due to the efforts of Ben Hansen (B. B. Hansen 2007), who created an R function, fullmatch, to do optimal matching from Fortan code created by Demetri Bertsekas D. P. Bertsekas (1981).
Does financial aid increase college attendance?
Four matched samples will be constructed: The construction of one of the four matched samples is presented in step-by-step detail.
During 1979-1981
Social Security Student Benefit Program provided a substantial
tuition benefit for a child of a deceased father, before the
cancellation in 1982.
Analyzing the impact on college attendance and completed schooling of the elimination of the Social Security Student Benefit Program in 1982
From 1965 to 1982, the Social Security Administration paid for millions of students to go to college. Under this program, the 18- to 22-year-old children of deceased, disabled, or retired Social Security beneficiaries received monthly payments while enrolled full time in college.
In 1981, Congress voted to eliminate the program. Then was cancelled in 1982
The program’s demise provides an opportunity to measure the incentive effects of financial aid. Using difference-in-differences methodology, and proxying for benefit eligibility with the death of a parent during an individual’s childhood, found that ? the elimination of the Social Security student benefit program reduced college attendance probabilities
Data Xb
with 2820 rows and 8 columns:
- faminc
: family income in units of $10,000
- incmiss
: income missing (1 if family income is missing, 0
o.w.)
- black
: 1 if black, 0 o.w.
- hispanic
: 1 if hispanic, 0 o.w.
- afqtpct
: the Armed Forces Qualifications Test
- edm
: mother’s education (1 for less than high school, 2
for high school, 3 for some college, 4 for BA degree or more)
- edmissm
: mother’s education missing (1 if missing, 0
o.w.)
- female
: gender (1 for female, 0 for male)
Imputation:
faminc
is set to 2 whenincmiss
= 1 indicating thatfaminc
is missing, andedm
is set to 0 whenedmissm
=1 indicating thatedm
is missing
The treatmentAFQT was missing for less than 2% of subjects, and these subjects are not used in the matching. With this restriction, there are 131 high school seniors with deceased fathers and 2689 other high school seniors in the 1979-1981 cohort (131+2689 = 2820)
zb
, where is 1 if the father is deceased, and
0 o.w.data("dynarski")
dim(dynarski)
[1] 2820 10
kable(head(dynarski, n=20), caption="First 20 people in the 1979-1981 cohort, the treatment zb and the eight covariates Xb") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
id | zb | faminc | incmiss | black | hisp | afqtpct | edmissm | edm | female |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | 5.3497560 | 0 | 0 | 0 | 71.901660 | 0 | 2 | 0 |
2 | 0 | 2.4627060 | 0 | 0 | 0 | 95.158050 | 0 | 2 | 1 |
4 | 0 | 9.4876030 | 0 | 0 | 0 | 95.785250 | 0 | 2 | 1 |
5 | 0 | 0.4388592 | 0 | 0 | 0 | 90.617160 | 0 | 2 | 1 |
6 | 0 | 10.4363600 | 0 | 0 | 0 | 81.761160 | 0 | 2 | 0 |
7 | 0 | 6.2694180 | 0 | 0 | 0 | 97.917710 | 0 | 4 | 0 |
10 | 1 | 3.2204620 | 0 | 0 | 0 | 61.916710 | 0 | 2 | 1 |
11 | 0 | 9.4719470 | 0 | 0 | 0 | 20.446560 | 0 | 2 | 1 |
12 | 0 | 2.7167480 | 0 | 0 | 0 | 57.175110 | 0 | 1 | 0 |
14 | 0 | 2.0000000 | 1 | 0 | 0 | 8.278976 | 0 | 1 | 1 |
15 | 0 | 7.1157020 | 0 | 0 | 0 | 50.928250 | 0 | 2 | 1 |
16 | 0 | 7.7669970 | 0 | 0 | 0 | 71.149020 | 0 | 2 | 1 |
17 | 0 | 12.5388400 | 0 | 0 | 0 | 74.912190 | 0 | 2 | 0 |
18 | 0 | 2.0000000 | 1 | 0 | 0 | 98.394380 | 0 | 2 | 1 |
21 | 0 | 9.4041260 | 0 | 0 | 0 | 98.720520 | 0 | 4 | 1 |
23 | 0 | 2.0000000 | 1 | 0 | 0 | 99.724040 | 0 | 4 | 0 |
24 | 0 | 2.0000000 | 1 | 0 | 0 | 73.281490 | 0 | 3 | 1 |
29 | 0 | 2.0000000 | 1 | 1 | 0 | 1.781234 | 0 | 1 | 0 |
31 | 0 | 2.0000000 | 1 | 0 | 0 | 42.900150 | 0 | 2 | 1 |
34 | 1 | 2.6090910 | 0 | 0 | 0 | 99.623680 | 0 | 2 | 0 |
#write.csv(dynarski, file="dynarski.csv")
<-dynarski$zb
zb<-factor(zb,levels=c(1,0),labels=c("Father deceased","Father not deceased"))
zbftable(zbf)
zbf
Father deceased Father not deceased
131 2689
<-dynarski[,3:10] Xb
\(\Rightarrow\) The 131 seniors with deceased fathers will each be matched to 10 controls whose fathers were not deceased.
with(dynarski, table(incmiss))
incmiss
0 1
2282 538
<-round(table(dynarski$incmiss)[2]/dim(dynarski)[1]*100,1)
incmiss.perpaste0("Percentage of income missing: ", incmiss.per)
[1] "Percentage of income missing: 19.1"
with(dynarski, table(edmissm))
edmissm
0 1
2704 116
<-round(table(dynarski$edmissm)[2]/dim(dynarski)[1]*100,1)
edmissm.perpaste0("Percentage of mother's education missing: ", edmissm.per)
[1] "Percentage of mother's education missing: 4.1"
Discussion of the missing covariate values will be later on another post.
# Estimate the propensity score
<- glm(zb ~ Xb$faminc + Xb$incmiss + Xb$black + Xb$hisp + Xb$afqtpct + Xb$edmissm + Xb$edm + Xb$female, family = binomial)$fitted.values p
\(\Rightarrow\) The vector
p
contains the 2820 estimated propensity scores, \(\hat{e}(\mathbf{x}_l)\), \(l\) = 1, 2, …, 2820.
Estimated propensity scores for 131 seniors with deceased fathers and 2689 other seniors in the 1979-1981 cohort, before the Social Security Student Benefit Program was eliminated.It is reasonable to be able to improve the model: perhaps including interaction terms or transformations or polynomials or whatnot.
boxplot(p~zbf, ylab="Propensity score", main="1979-1981 Cohort")
$p <- p
dynarskiggplot(dynarski, aes(p, fill = zbf)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') +
ggtitle("Figure of Overlay Histograms of PS in 2 Groups") +
theme_bw()
\(\Longrightarrow\) the median \(\hat{e}(\mathbf{x})\) in the treated group (0.064) is about equal to the upper quartile among potential controls (0.063). Nonetheless, the distributions in Figure of Overlay Histogram of PS in 2 Groups overlap substantially, so matching appears to be possible.
#Robust Mahalanobis distance matrix, treated x control
<-smahal(zb,Xb)
dmatdim(dmat)
[1] 131 2689
kable(round(dmat[1:5,1:5],2), caption="First five rows and columns of the 131×2689 distance matrix using the rank-based Mahalanobis distance") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
7 | 3.86 | 2.61 | 5.06 | 6.86 | 6.72 |
20 | 3.47 | 3.03 | 7.58 | 6.23 | 5.82 |
108 | 9.60 | 19.47 | 20.02 | 24.62 | 13.03 |
126 | 6.81 | 8.05 | 12.93 | 10.74 | 9.88 |
145 | 8.70 | 15.09 | 17.74 | 18.86 | 12.37 |
#Add a caliper on the propensity score using a penalty function
<-addcaliper(dmat,zb,p,caliper=.2)
dmatdim(dmat)
[1] 131 2689
kable(round(dmat[1:5,1:5],2), caption="First five rows and columns of the 131×2689 distance matrix after adding the propensity score calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
7 | 18.60 | 20.64 | 42.04 | 79.91 | 46.10 |
20 | 46.32 | 3.03 | 72.66 | 51.18 | 73.30 |
108 | 82.94 | 47.40 | 115.60 | 39.07 | 111.01 |
126 | 57.81 | 13.64 | 86.16 | 47.54 | 85.51 |
145 | 8.70 | 54.51 | 33.32 | 113.31 | 30.34 |
\(\Rightarrow\) Among these 25 entries, only 2 respected the caliper and did not incur a penalty (3.03 and 8.70).
Matching ten controls to each senior with a deceased father: The
fullmatch
function needs to know the distance matrix, here
dmat
, matching 10-to-1 means 10x131 will be used, omitting
2689−10x131 = 1379, which is the omit fraction be: 51%.
<-fullmatch(dmat, data=dynarski, min.controls=10, max.controls=10, omit.fraction=1379/2689) m
There is an entry in m for each of the 2820 seniors
length(m)
[1] 2820
The first ten entries in m are
1:10] m[
1 2 3 4 5 6 7 8 9 10
1.129 1.100 <NA> 1.54 <NA> 1.121 1.114 <NA> 1.111 1.87
i.e. the first senior of the 2820 seniors is in matched set #34 and the second senior is in matched set #10. The third senior was one of the 1379 unmatched controls; this is the meaning of the zero in m.01. The fourth senior is in matched set #87, the fifth is unmatched, and so on.
The function matched(·) indicates who is matched. The first ten entries of matched(m) are
matched(m)[1:10]
1 2 3 4 5 6 7 8 9 10
TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
the first two seniors were matched but the third was not, and so on.
There are 1441 matched seniors, where 1441 = 131×11, because the 131 seniors with deceased fathers were each matched to ten controls, making 131 matched sets each of size 11.
sum(matched(m))
[1] 1441
sum(matched(m))/11
[1] 131
The first 3 matched sets are in Table below. The first match consists of 11 female high school seniors, neither black nor hispanic, whose mothers had a high school education, with family incomes between $30,000 and $40,000, mostly with test scores between 59% and 77%. In the second matched set, incomes were lower but test scores were higher. And so on.
# Housekeeping
<-as.integer(m) # matched set from 1 to 131
im<-cbind(dynarski,im)
dynarski<-dynarski[matched(m),] # only select matched cases, note that matched(m): T\F
dm<-dm[order(dm$im,1-dm$zb),] # sort data according to matched set then zb decreasing
dm
# Table of matched set example
#which(dm$id==10) # [1] 188
#which(dm$id==396) # [1] 23
#which(dm$id==3051) # [1] 1068
kable(rbind(dm[188:198,-c(11,12)], dm[23:33,-c(11,12)],dm[1068:1078,-c(11,12)]), caption="The 3 of 131 matched sets, each set containing one treated subject and 10 matched controls") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
id | zb | faminc | incmiss | black | hisp | afqtpct | edmissm | edm | female | |
---|---|---|---|---|---|---|---|---|---|---|
7 | 10 | 1 | 3.220462 | 0 | 0 | 0 | 61.91671 | 0 | 2 | 1 |
239 | 350 | 0 | 3.557851 | 0 | 0 | 0 | 59.58355 | 0 | 2 | 1 |
252 | 365 | 0 | 3.599340 | 0 | 0 | 0 | 61.23934 | 0 | 2 | 1 |
322 | 465 | 0 | 3.557851 | 0 | 0 | 0 | 56.37230 | 0 | 2 | 1 |
361 | 518 | 0 | 3.788779 | 0 | 0 | 0 | 67.00953 | 0 | 2 | 1 |
383 | 550 | 0 | 3.788779 | 0 | 0 | 0 | 63.49724 | 0 | 2 | 1 |
928 | 1294 | 0 | 3.557851 | 0 | 0 | 0 | 67.31059 | 0 | 2 | 1 |
1481 | 2072 | 0 | 3.788779 | 0 | 0 | 0 | 64.97742 | 0 | 2 | 1 |
1489 | 2082 | 0 | 3.788779 | 0 | 0 | 0 | 63.62268 | 0 | 2 | 1 |
1560 | 2183 | 0 | 3.970631 | 0 | 0 | 0 | 76.51781 | 0 | 2 | 1 |
2810 | 3965 | 0 | 3.761650 | 0 | 0 | 0 | 72.57903 | 0 | 2 | 1 |
268 | 396 | 1 | 2.371901 | 0 | 0 | 0 | 88.50979 | 0 | 2 | 1 |
2 | 2 | 0 | 2.462706 | 0 | 0 | 0 | 95.15805 | 0 | 2 | 1 |
101 | 147 | 0 | 2.273267 | 0 | 0 | 0 | 77.09483 | 0 | 2 | 1 |
375 | 537 | 0 | 2.595314 | 0 | 0 | 0 | 95.96086 | 0 | 2 | 1 |
665 | 933 | 0 | 2.846281 | 0 | 0 | 0 | 96.11139 | 0 | 2 | 1 |
698 | 974 | 0 | 1.897521 | 0 | 0 | 0 | 99.59859 | 0 | 3 | 1 |
707 | 987 | 0 | 2.134711 | 0 | 0 | 0 | 81.18414 | 0 | 2 | 1 |
1394 | 1947 | 0 | 2.050298 | 0 | 0 | 0 | 91.44506 | 0 | 3 | 1 |
1518 | 2124 | 0 | 2.298786 | 0 | 0 | 0 | 72.40341 | 0 | 2 | 1 |
1877 | 2618 | 0 | 2.211560 | 0 | 0 | 0 | 68.91621 | 0 | 2 | 1 |
2816 | 3975 | 0 | 2.371901 | 0 | 0 | 0 | 90.74260 | 0 | 2 | 1 |
2182 | 3051 | 1 | 3.409901 | 0 | 0 | 1 | 62.87004 | 0 | 1 | 0 |
427 | 606 | 0 | 4.179612 | 0 | 0 | 1 | 81.73608 | 0 | 1 | 0 |
472 | 664 | 0 | 4.388592 | 0 | 0 | 1 | 91.57050 | 0 | 1 | 0 |
628 | 884 | 0 | 2.846281 | 0 | 0 | 1 | 48.77070 | 0 | 1 | 0 |
713 | 995 | 0 | 3.134709 | 0 | 0 | 1 | 55.11791 | 0 | 1 | 0 |
724 | 1008 | 0 | 3.320661 | 0 | 0 | 1 | 51.60562 | 0 | 1 | 0 |
1006 | 1399 | 0 | 3.439256 | 0 | 0 | 1 | 90.01505 | 0 | 2 | 0 |
2087 | 2908 | 0 | 3.795041 | 0 | 0 | 1 | 80.28098 | 0 | 1 | 0 |
2328 | 3262 | 0 | 3.788779 | 0 | 0 | 1 | 57.27546 | 0 | 1 | 0 |
2424 | 3400 | 0 | 2.925728 | 0 | 0 | 1 | 80.88309 | 0 | 2 | 0 |
2576 | 3624 | 0 | 3.320661 | 0 | 0 | 1 | 74.08430 | 0 | 1 | 1 |
Note that: for covariate \(k\),
- \(\bar{x}_{tk}\) is the mean in the
first group in the comparison,
- \(\bar{x}_{ck}\) is the mean in the
second group in the comparison, and
- \(\bar{x}_{mck}\) is the mean in the
matched comparison group formed from the second group
- \(sd_{bk}\) is the standardized
difference in means before matching and
- \(sd_{mk}\) is the standardized
difference in means after matching; they have the same denominator but
different numerators
#fd <- dynarski[dynarski$zb==1,-c(1,2,12)]
#round(apply(fd, 2, FUN=mean),2)
<-dynarski %>%
d.tkfilter(zb==1) %>%
mutate(faminc=ifelse(incmiss==1,NA, faminc),
edm=ifelse(edmissm==1, NA, edm)) %>%
select(-id,-zb,-im)
<-d.tk %>% summarise_all(mean, na.rm=TRUE)
x.tk<-d.tk %>% summarise_all(sd, na.rm=TRUE)
s.tk
<-dynarski %>%
d.ckfilter(zb==0) %>%
mutate(faminc=ifelse(incmiss==1,NA, faminc),
edm=ifelse(edmissm==1, NA, edm)) %>%
select(-id,-zb,-im)
<-d.ck %>% summarise_all(mean, na.rm=TRUE)
x.ck<-d.ck %>% summarise_all(sd, na.rm=TRUE)
s.ck
<-abs(x.tk-x.ck)/sqrt((s.tk^2+s.ck^2)/2)
sd.bk
<-dm %>%
d.cmkfilter(zb==0) %>%
mutate(faminc=ifelse(incmiss==1,NA, faminc),
edm=ifelse(edmissm==1, NA, edm)) %>%
select(-id,-zb,-im)
<-d.cmk %>% summarise_all(mean, na.rm=TRUE)
x.cmk<-abs(x.tk-x.cmk)/sqrt((s.tk^2+s.ck^2)/2)
sd.cmk
<-round(rbind(x.tk,x.cmk,x.ck,sd.bk,sd.cmk),2)
setrow.names(set) <- c("x.tk","x.cmk","x.ck","sd.bk","sd.cmk")
kable(set, caption="Balance on covariates before and after matching") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
faminc | incmiss | black | hisp | afqtpct | edmissm | edm | female | p | |
---|---|---|---|---|---|---|---|---|---|
x.tk | 2.78 | 0.15 | 0.35 | 0.15 | 49.58 | 0.08 | 1.62 | 0.49 | 0.07 |
x.cmk | 2.77 | 0.15 | 0.34 | 0.15 | 49.10 | 0.04 | 1.61 | 0.50 | 0.06 |
x.ck | 4.58 | 0.19 | 0.29 | 0.15 | 52.39 | 0.04 | 1.91 | 0.50 | 0.05 |
sd.bk | 0.71 | 0.11 | 0.13 | 0.02 | 0.10 | 0.19 | 0.33 | 0.03 | 0.67 |
sd.cmk | 0.00 | 0.00 | 0.02 | 0.03 | 0.02 | 0.19 | 0.02 | 0.02 | 0.09 |
\(\Longrightarrow\) They were quite different before matching but were much closer after matching. The family income of seniors with deceased fathers was lower, they were more often black, and their mothers had less education. Between 1979-1981, AFQT test scores decline. The imbalances are much smaller after matching.
The outcomes were not provided in the Dynarski’s dataset. But I guessed we can report the proportions and MH-odds ratio between 2 groups after matching.
In short, during 1979-1981, when the Social Security Student Benefit
Program provided tuition aid to students of a deceased Social Security
beneficiary, seniors with deceased fathers were more likely to
attend college and complete one year of college than were
matched controls, with an odds ratio of about 1.65
, but
there is no sign of this in 1982-1983 after the program ended (the
later can be checked because of lack the data for analysis).
Standardized differences in covariate means before and after matching in matched comparisons. The boxplot displays standardized differences in means, for the nine covariates and the propensity score.
<-c(0,1)
m<-rbind(sd.bk,sd.cmk)
d.sd<-cbind(d.sd,m)
d.sd<-d.sd %>%
d.sd.longpivot_longer(-m,names_to="absstdzdiff", values_to="sd")
$m<-factor(d.sd.long$m,levels=c(0,1),labels=c("Unmatched","Matched"))
d.sd.longboxplot(sd~m, data=d.sd.long, xlab="", ylab="Absolute Standardized Difference", main="1979-1981 Cohort: FD vs. FND")
fullmatch
functionfullmatch
is doingproc assign
in SASIf you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2021, July 3). HaiBiostat: Matching in R. Retrieved from https://hai-mn.github.io/posts/2021-07-03-matching in R-causal-inference/
BibTeX citation
@misc{nguyen2021matching, author = {Nguyen, Hai}, title = {HaiBiostat: Matching in R}, url = {https://hai-mn.github.io/posts/2021-07-03-matching in R-causal-inference/}, year = {2021} }